Scraping Huizenzoeker.nl to Analyse the Dutch Housing Market


Which places in the Netherlands are hit hardest by the Dutch Housing crisis, and which the least?

Currently, the housing crisis is one of the most prominent societal challenges in the Netherlands. The Dutch housing market is both very competitive as well as inaccessible as it must deal with a supply shortage, which in turn leads to long waiting lists for social housing. The transaction prices of houses are going through the roof, as CBS (Centraal Planbureau voor Statistiek, 2021) revealed that prices in the first quarter of 2021 were 11,3% higher compared to a year before, which is way above the average increase in Europe. Many prospective buyers have to overbid on the listings in order to ensure a place to live. A huge problem is that the amount of mortage is determined by the appraisal value of the house, which causes many to put their own capital into the purchase. This makes it very challenging for new entrants on the market (think of first-time buyers, young professionals) to succeed in renting or buying a house or appartment, especially as many are still paying off large amounts of student debt. It appears that only 3% of first-time buyers are financially able to buy a home without getting themselves into serious financial trouble.

We have considered a few multiple housing sites to incorporate into our project, where Huizenzoeker.nl appeared to be the most suitable option. This website offers a clear view of the Dutch housing market with a wide range of listings, displaying an extensive amount of information (per listing and neighbourhood). Funda.nl currently is the largest housing provider in the Netherlands, however, the site is not useful for this project. Funda installed secure protection for its data to brace for competitor sites. Similarly, Zoekallehuizen.nl offers a large range of listings too, but could not provide us with important information needed to research the housing crisis, e.g. overbidding percentages. Similarly, Remax.nl, is a large housing website, yet, mainly focusing on houses in other countries, like Spain and Belgium. As we are determined to analyse the Dutch housing market by cause of the severe current crisis, Remax.nl has not sufficed to our needs.

Huizenzoeker data used is available at Huizenzoeker.nl.

Repository overview


  • readme.txt = documentation project.
  • docs/ = stores any supporting files for the documentation.
  • data/ = stores the raw data files.
  • src/ = stores the files source codes for collecting the data, and used to generate statistics/insights documented in the README.

Contributors


This is a repository for the course Online Data Collection and Management at Tilburg University as part of the Master’s program ‘Marketing Analytics’, used for the team project of group 3.

Members of our team:

Documentation


1 Motivation


Our datasource ‘Huizenzoeker.nl’ fits well into the data aggregator and low scale/scope category of Figure W3.1: Data Source Exploration of the ‘Fields of Gold’ paper (add a reference). Namely the data available on Huizenzoeker.nl is less detailed, but contains data from multiple platforms (multi-platform data), and it has only regional coverage as opposed to global coverage (only presents data on the housing market for the municipalities in the Netherlands; so for the most part only useful for those living or planning to live in the Netherlands). As for content type, we can put this site into the e-commerce type (??, not sure what fits best).

add the table from the Fields of Gold paper here

Although Huizenzoeker.nl is lesser known (lesser users) than other housing sites (even those on regional level too) such as Funda, it offers the richer data and novel measures we need to answer our research questions. Instead of having to gather the data from numerous pimrary data providers in the housing sector, this data aggregator facilitated our collection of multi-platform data far more efficiently.

1.1 For what purpose was the dataset created? Was there a specific task in mind? Was there a specific gap that needed to be filled? Please provide a description.

As the seriousness of the housing crisis and the shortage of listings differs across the country, we aimed to create a dataset which represented the current housing market for each municipality in every province of the Netherlands. It would clarify which places are hit hardest by the crisis and which the least. With this dataset, we may faciliate these first-time buyers and young professionals in terms of their search to buy or a rent a house by showing them where they would have the highest chances. Furthermore, it would provide them with insights on recent price developments of listings in a certain area, which helps them in negotiations about the purhcase price. Therefore, this dataset provides consumers with other data in addition to what information is offered by their broker (e.g. direct information from the Kadaster). There were already some datasets available on the Dutch housing market, however these did not specifically focus on the overbidding aspect of the current crisis which forms an essential part of our research. Besides that, instead of only focussing on certain parts of the Netherlands we preferred to focus on all municipalities in the Netherlands to get a more complete picture of the current state of the housing crisis. By focusing on municipalities, the units we are analyzing are small enough to deeply dive into the housing market of the Netherlands locally (as opposed to only focusing on provinces), yet, the units are large enough to maintain order and control in our dataset (as opposed to focusing on every house that is for sale in the Netherlands).

1.2 Who created this dataset (e.g. which team, research group) and on behalf of which entity (e.g., company, institution, organization)?

Huizenzoeker.nl is an independent platform which is not influenced or moderated by estate agents, as it aims to inform its clients in a honest manner with reliable information. It is perceived as an aggregate site which collects information from public different sources, such as JAAP.nl. However, as Huizenzoeker.nl is owned by Spotzi, a big data visualizations specialist that focuses on the visualizing and analysing of spatial data, The Huizenzoeker team also provides much data themselves. From Spotzi, they retrieve much data on, for example, the value of the listings and development of housing prices. This is beneficial to our scraping project as this resulted in longs lists of information present for each listing, municipality, province etc. Therefore, it does not only provide specifics on the houses themselves like every other site, but also on the neighbourhood, the mean income in the municipality, the distance to the closest supermarket, etc. The platform states that it is a partner of JAAP.nl and Huislijn.nl, however it does not have explicit consent from JAAP.NL to show all information that is displayed on JAAP.nl (which seems quite contradictionary). In turn, databanks like JAAP.nl get their data from other sites, such as Funda.nl.

1.3 Who funded the creation of the dataset? If there is an associated grant, please provide the name of the grantor and the grant name and number.

The dataset is funded by advertisers on the site. Advertisers can target vistors on the site through the filter options, which allows advertisers to target based on different home characteristics or on region and price range. As Spotzi created various profiles from the data they compiled, e.g. starters (young and ambitious), families with children (nest builder), kids away from home (thriving fifties); advertisers can target very specifically to a certain audience within a certain zip code. These profiles can also be used on external sites through the Rearch extension function. Maybe we can still add a little more information here

2 Composition


2.1 What do instances that comprise the dataset represent (e.g. documents, photos, people, countries)? Are there multiple types of in-stances (e.g. movies, users, ratings; people and interactions between them; nodes and edges)? Please provide a description.

The instances that comprise the dataset represent all municipalities of the Netherlands for every price; this is one type of instance. Therefore, the entities thus summarize the data of the housing market (for every house) in that municipality; these values thus represent the averages per municipality (step 3 of the navigation path below). The instances are connected to eachother by the province that they are in; therefore all municipalities also belong to a larger type of instance, the provinces (step 2 of the navigation path below).

The instances that comprise the dataset represent houses (maybe more specific, so all houses or houses recently sold or currently available?). However, in our dataset housing data is grouped at muncipality-level, where values represent the average number per municipality (step 3 of navigation path below). In turn, all municipalities belong to a larger type of instance, the provinces (step 2 of navigation path below).

The following screenshots represent a brief navigation path:

  1. The homepage of Huizenzoeker.nl. From here, one can scroll down to the ‘Woningmarkt’-section and navigate to one of the province pages.
Screenshot 2021-10-12 at 11 11 32
  1. This is an example of a province page. This particular one considers the provinde Noord-Brabant. Notice how the url got extended with ‘woningmarkt/noord-brabant/’. From this page, one can scroll down and select one of the municipality-pages that exist for the province in question.
Screenshot 2021-10-12 at 11 11 44
  1. This is an example of a municipality page within the province Noord-Brabant. This particular one considers the municipality Tilburg. Notice how the url got extended with ‘tilburg/’.

I will try to make a gif that zooms in on the url! (Lesley)

Screenshot 2021-10-12 at 11 12 02

The goal of this project has been to scrape information per municipality and per province (for completeness). Therefore, pages like the ones displayed under step 2 and 3 have been utilized to obtain statistical housing-related measures per municipality and province.

2.2 How many instances are there in total (of each type, if appropriate?)

If every municipality is seen as an instance, we would say there are 352 municipalities in total, which are spread over the 12 provinces of the Netherlands.

  • Groningen = 10 municipalities
  • Friesland = 18 municipalities
  • Drenthe = 12 municipalities
  • Overijssel = 25 municipalities
  • Flevoland = 6 municipalities
  • Gelderland = 51 municipalities
  • Noord-Holland = 47 municipalities
  • Zuid-Holland = 52 municipalities
  • Utrecht = 26 municipalities
  • Limburg = 31 municipalities
  • Noord-Brabant = 61 municipalities
  • Zeeland = 13 municipalities
  • Total = 352 municipalities.

2.3 Does the dataset contain all possible instances or is it a sample (not necessarily random) of instances from a larger set? If the dataset is a sample, then what is the larger set? Is the sample representative of the larger set (e.g., geographic coverage)? If so, please describe how this representativeness was validated/verified. If it is not representative of the larger set, please describe why not (e.g. to cover a more diverse range of instances, because instances were withheld or unavailable).

All house listings in the Netherlands are included in our dataset (the full population), however their data is grouped and displayed as an average for each municipality. As every listing is located in a municipality, the sample is certainly representative of the larger set. Thus, our sample knows full geographic coverage.

2.4 What data does each instance consist of? ‘Raw’ data (e.g. unprocessed text or images) or features? In either case, please provide a description.

here we call provinces instances again, while I think it should be municipality = instance

To clarify, each province, or instance, knows its own page, but next to that, also is the parent of several municipality pages (as described in question 2.2). However, the structure of this province and municipality page are almost identical. For this question, illustration is based on one of the municipality pages, Tilburg (Noord-Brabant).

First, all pages display a map of the Netherlands and their specific location on the map. Next, all contain a link to all the houses that are for sale followed by a link for all houses that are for rent. Furthermore, a subsequent link directs to themost expensive houses of the munucipality/province in question.

Screenshot 2021-10-12 at 12 24 53

Next, each page displays 4 ‘trend’ statistics. Each of the 4 numbers contains a related percentual number, reflecting the percentual difference of the statistic compared to the month before. The first trend refers to the average selling price of a house within the municipality/province. The second trend refers to the number of houses sold in the past month. The third trend refelcts the average selling price per squared meter. And the fourth trend indicates what the average outbidding percentage is within the municipality/province in question. These trends will be of high importance during our project.

Screenshot 2021-10-12 at 12 27 31

Moreover, all pages cover histograms that show price and housing supply trends. Additionally, a link is included to access more information about the housing market in question.

Screenshot 2021-10-12 at 12 35 16

Next, a section is shown in which several questions are answered in unprocessed text. The first questions, cover the exact same as the first 4 trend statistics. However, the last ones cover the population number and population growth/decline compared to the year before. This population-related information, again, will be of high relevance later in our project.

Screenshot 2021-10-12 at 12 36 56

Furthermore, a pie chart showing the average age distribution in the province/municipality is included. Moreover, a statistic on average disposable income is included, which again will be important later in our project.

Screenshot 2021-10-12 at 12 41 48

Finally, at the bottom of the page random houses that are for sale/rent are displayed, followed by links that navigate to a ‘child’-page (e.g. from province page to municipality page).

2.5 Is there a label or target associated with each instance? If so, please provide a description.

From each province in the Netherlands, we intend to scrape all corresponding municipalities. For the provinces an associated URL is for example ‘https://www.huizenzoeker.nl/woningmarkt/noord-brabant/’, which changes to ‘https://www.huizenzoeker.nl/woningmarkt/noord-brabant/tilburg/’ for Tilburg. So, each instance that we want to scrape corresponds to their own URL.

Moreover, within the code we wrote, we extracted the municipality or province name for each of these URLs, by scraping the title and removing the word ‘Woningmarkt’ from it. Therefore, we changed the official label to an artificial one for clarity purposes, e.g. now the municipality Tilburg can be identified through the label ‘Tilburg’, instead of its URL.

2.6 Is any information missing from individual instances? If so, provide a description, explaining why this information is missing (e.g., because it was unavailable). This does not include intentionally removed information, but might include, e.g. redacted text.

For our purposes of scraping Huizenzoeker.nl, there is no information missing that we wanted to include in our dataset. Maybe include that we wanted to scrape the graphs as well but that we couldn’t because the information of the graphs wasn’t given in the source code

a) are there guarantees that they will exist and remain constant over time;

Huizenzoeker.nl states that, for over 10 years, they have made every effort possible to ensure that this website functions properly and is kept permanently accessible for reputational reasons. Huizenzoeker.nl edits the information offered on its site with the greatest possible care and devotes the same care to the composition of the site. However, it legally cannot guarantee the correctness and completeness of the data shown as a result of imperfections that may occur. Moreover, Huizenzoeker is able to adapt the website where and whenever they please. No restrictions hold. This information has been retrieved from the disclaimer section on the officiel Huizenzoeker website.

b) are there official arhival versions of the complete datasets (i.e. including the external resources as they existed at the time the dataset was created).

Possibly for own utilization. However, no official arcihval versions of the complete datasets are available to us as the public of Huizenzoeker.nl. Huizenzoeker.nl displays real-time data, and not so much archival data for the data we scrape to answer our research objective (there is for instance data on the ‘prijsontwikkelingen’ over the last couple of years, which means data for the previous years must be available too). The data we scrape from the municipality pages is data that is updated every month, so when scraping this page you do not get direct access to the figures or averages for the previous months.

2.11 Does the dataset contain data that, if viewed directly, might be offensive, insulting, threatening, or might otherwise cause anxiety? If so, please describe why.

No the data can in no way be perceived as offensive, insulting, or threatening.

2.12 Does the dataset relate to people? If not, you may skip the remaining questions in this section.

Not applicable.

2.13 Does the dataset identify any subpopulations (e.g., by age, gender)? If so, please describe how these subpopulations are identified and provide a description of their respective distributions within the dataset.

Within out dataset we only scrape the number of inhabitants per municipality/province, and the average disposable income per municipality/province. Therefore, one subpopulation in terms of different levels of average disposable income can considered to be present.

LOOK FOR DISTRIBUTIONS FOR THIS SUBPOPULATION !!!!!
I don’t think we have subpopulations, as we scrape houses and not people, so I don’t think we need to identify subpopulations; we scrape data for every person able to buy a house and don’t target only starters who are buying a house, or only very rich people buying villas

2.14 Is it possible to identify individuals (i.e., one or more natural persons), either directly or indirectly (i.e., in combination with other data) from the dataset? If so, please describe how.

Not applicable.

2.15 Does the dataset contain data that might be considered sensitive in any way (e.g. data that reveals racial or ethnic origins, sexual orientations, religious beliefs, political opinions or union memberships, or locations; financial or health data; biometric or genetic data; forms of government identification, such as social security numbers; criminal history)? If so, please provide a description.

Not applicable.

3 Collection process


3.1 How was the data associated with each instance acquired? Was the data directly observable (e.g., raw text, movie ratings), reported by subjects (e.g., survey responses), or indirectly inferred/derived from other data (e.g., part-of-speech tags, model-based guesses for age or language)? If data was reported by subjects or indirectly inferred/derived from other data, was the data validated/verified? If so, please describe how.

Via the selenium package we accessed ChromeDriverManager. Using this webdriver we were eventually able to scrape all the possible municipality links for the entire Huizenzoeker website. At first, we wanted to work with BeautifulSoup only, as the package is user friendly and works quickly. Yet, despite the benefits, this package was only able to extract the municipality links for one province at the time, as opposed to all links for all 12 provinces at once. On the other hand, Selenium is designed to automate test for Web Applications, and made it possible for us to finish the job.

The next step was to extract all the variables desribed in question 2.8. One gigant code has been created for this step. At the beginning of this code, we made sure all the output got saved into a json file. Next, one big for-loop is created that will loop through all of the munucipality pages, making sure the code gets 5 seconds of sleep. Within the loop we loaded the BeautifulSoup package. This time, BeautifulSoup worked just fine for extracting the variables. And as BeautifulSoup is the simpler and quicker method, we chose to stick with this package at this step. Then we defined the first variable intended identification purposes: the municipality name. These were all the steps that needed to be completed in order to scrape all the variables we wanted from the municipality pages. These variables have been created using many ‘if’ - ‘else’ statements, tailoring each variable to its corresponding html output that can be accessed when inspecting the municipality webpage. Furthermore, irrelevent characters/words have been dropped to make output better understandable. At the bottom of the code all the variables are appended into a list.

Using the ‘pandas’ package we were able to convert this list with variables into a large table (dataframe) containing all variables per municipality. This dataframe, in turn, is converted into a csv file.

To obtain summary statistics we did the following. As most variables were seen as characters, while they should have been numerics, we exported the final dataframe to R to change these datatypes. After that, we exported it again, but as a CSV, to then use it to generate some summary statistics: count, mean, std, min, max, 25%, 50%, 75%.

The last part of the jupyter script shows how the same steps as described above are used to scrape the URLs on province level, as opposed to municipality level.

All the data we scraped from the Huizenzoeker platform was data that was directly observable in the form of raw text.

3.2 What mechanisms or procedures were used to collect the data (e.g., hardware apparatus or sensor, manual human curation, software program, software API)? How were these mechanisms or procedures validated?

We scraped the data using Python’s programming software in Jupyter Notebooks. By loading the packages BeautifulSoup, Selenium, requests, re, pandas, time, webdriver manager, and json, we were able to use functions allowing for our specific webscraping steps.

Huizenzoeker.nl does not provide an official software API (anymore), so we scraped the data by writing code ourselves.

3.3 If the dataset is a sample from a larger set, what was the sampling strategy (e.g., deterministic, probabilistic with specific sampling probabilities)?

Technically, we have taken the entire population, and no sample, to conduct our project with. We took all the municipality pages as input, an not a portion of them.

Yet, logically, we have taken a sample. Namely, a single unit would represent a single house in logical terms. However, as the statistics we were after were only available on an average-level on the municipality pages, we took the municipality pages as single units. A municipality page consists of average numbers from all the single houses present in that region. Thus, that is the sampling strategy applied.

3.4 Who was involved in the data collection process (e.g., students, crowdworkers, contractors) and how were they compensated (e.g., how much were crowdworkers paid)?

In the data collection process solely the team members of this project were involved.

3.5 Over what timeframe was the data collected? Does this timeframe match the creation timeframe of the data associated with the instances (e.g., recent crawl of old news articles)? If not, please describe the time-frame in which the data associated with the instances was created.

Huizenzoeker.nl covers the housing market data of October 2021. This is the most recent housing market data. Huizenzoeker.nl shows this most-recent data because the housing market changes every month (e.g., houses are sold, new houses are offered, the asking price may be more extremely outbid in one month than in the other month, etc.).

3.7 Does the dataset relate to people? If not, you may skip the remaining questions in this section.

Not applicable.

3.8 Did you collect the data from the individuals in question directly, or obtain it via third parties or other sources (e.g., websites)?

Not applicable.

4 Preprocessing, cleaning, labeling


4.1 Was any preprocessing/cleaning/labeling of the data done (e.g., discretization or bucketing, tokenization, part-of-speech tagging, SIFT feature extraction, removal of instances, processing of missing values)? If so, please provide a description. If not, you may skip the remainder of the questions in this section.

First of all, all the values of the variables have been cleaned in a way that they only give a certain numeric value or percentage as output (no additional words, and only consistent punctuation). This means removing the HTML tag words, stripping out unncessary characters and retaining relevant substrings only. To achieve this, we made use of regular expressions (regex) to pre-process the textual data. When no numeric value exists for a specific municipality, we encoded that ‘NA’ will result as output for the variable in question. Furthermore, all the variables have been assigned a clear label, such that the numeric values are given a meaning. For example, we identified values as provinces, cities, and for all variables. Additionally, all the variables have been displayed in a table against all the municipalities/provinces as a small start in preprocessing.

Add more info on this by looking at Fields of Gold paper: step 4: data extraction

5 Uses


5.1 Has the dataset been used for any tasks already? If so, please provide a description.

We used our dataset in RStudio to create some plots and figures of the data we collected. We did this to give insights into how we would compare the municipalities for each province, and the data between the provinces.

5.3 What (other) tasks could the dataset be used for?

Broadly speaking, a suitable task this dataset can be used for is helping (future) inhabitants of the Netherlands find their ideal home. By accessing our data, a person could find the best municipality to live in for this person’s specific circumstances (e.g. specific disposable income level), find a region where the value-for-money seems to be of high standard, to help them in negotiations on the price, to help them find out what is the norm in terms of overbidding for each municipality, and more.

5.5 Are there tasks for which the dataset should not be used? If so, please provide a description.

The dataset can be used for any matters regarding the housing market in the Netherlands, at municipality level as well as province level. For anything outside of this topic, the dataset has no use.